Algorithms for bigram and trigram word clustering

نویسندگان

  • Sven C. Martin
  • Jörg Liermann
  • Hermann Ney
چکیده

CLUSTERING Sven Martin, J org Liermann, Hermann Ney Lehrstuhl f ur Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany ABSTRACT. This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced word classing for model M

Model M is a superior class-based n-gram model that has shown improvements on a variety of tasks and domains. In previous work with Model M, bigram mutual information clustering has been used to derive word classes. In this paper, we introduce a new word classing method designed to closely match with Model M. The proposed classing technique achieves gains in speech recognition word-error rate o...

متن کامل

New Developments in Lattice-Based Search Strategies in SRI’s Hub4 System

We describe new developments in SRI’s lattice-based progressive search strategy. These developments include the implementation of a new bigram lattice algorithm, lattice optimization techniques, and expansion of bigram lattices to trigram lattices. The new bigram lattice generation algorithm is based on generation of backtrace entries using a word-dependent N-best list decoding pass, followed b...

متن کامل

Building and Incorporating Language Models for Persian Continuous Speech Recognition Systems

In this paper building statistical language models for Persian language using a corpus and incorporating them in Persian continuous speech recognition (CSR) system are described. We used Persian Text Corpus for building the language models. First we preprocessed the texts of corpus by correcting the different orthography of words. Also, the number of POS tags was decreased by clustering POS tag...

متن کامل

Efficient lattice representation and generation

In large-vocabulary, multi-pass speech recognition systems, it is desirable to generate word lattices incorporating a large number of hypotheses while keeping the lattice sizes small. We describe two new techniques for reducing word lattice sizes without eliminating hypotheses. The first technique is an algorithm to reduce the size of non-deterministic bigram word lattices. The algorithm iterat...

متن کامل

Class phrase models for language modelling

Previous attempts to automatically determine multi-words as the basic unit for language modeling have been successful for extending bigram models 10, 9, 2, 8] to improve the per-plexity of the language model and/or the word accuracy of the speech decoder. However, none of these techniques gave improvements over the trigram model so far, except for the rather controlled ATIS task 8]. We therefor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Speech Communication

دوره 24  شماره 

صفحات  -

تاریخ انتشار 1995